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ABSTRACT 

This  paper  describes  an  algorithm  which,  for  suitable  grammars, 
maps  the  Backus  Naur  Form  (BNF)  definition  of  the  grammar  of  a 
language  into  a  parser  for  the  sentences  in  that  language.   By  design 
the  algorithm  generates  a  suitable  parser  for  any  bounded  right  context 
grammar.   It  happens  that  it  also  covers  some  LR(k)  grammars  which  are 
not  bounded  right  context.  A  modified  version  of  Floyd's  descriptive 
language  for  symbol  manipulation  is  used  to  describe  the  parser. 
Several  examples  illustrate  the  application  and  generality  of  the 
algorithm. 
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Introduction 

The  algorithm  described  herein  is  in  essence  an  extension, 
albeit  a  simplification,  of  the  work  of  Earley^   which  in  turn  was 
based  on  Evans/  '   Feldman,^'  Floyd/  '''  and  Standi  sir ' '.   For  a 
large  subset  of  grammars,  the  algorithm  maps  the  Backus  Naur  Form  (BNF) 
definition  of  the  grammar  of  a  language  into  a  deterministic,  left- to- 
right  parser  for  the  sentences  in  that  language.   It  is  shown  below 
that  the  algorithm,  by  design,  covers  all  bounded  right  context  grammars 
and,  as  a  by-product,  some  LR(k)  grammars  which  are  not  bounded  right 

(g\ — 

context  (see  Knuth    for  the  definitions  of  these  classes  of  grammars). 

More  precisely,  the  algorithm  maps  a  set  of  BNF  productions 
into  a  program:  a  "reductions  analysis"  program*  consisting  of  modified 
Floyd  productions  (really  reductions)  referred  to  below  as  FPL  (Floyd 
Production  Language)  statements.**  The  program  consists  of  labeled, 
mutually  exclusive  groups  of  statements  called  sections.  Each  section 
has  a  specific  task  to  perform.   It  is  activated,  by  transfer  of  control 
to  its  first  statement  via  the  label,  only  at  appropriate  times.  Upon 
each  activation  it  either  scans  a  new  terminal  symbol  or  makes  a  reduction 
(combined  with  an  "unscan"  in  the  case  of  a  production  with  an  empty 
right  part)  and  then  transfers  control  to  the  appropriate  next  section, 
or  it  transfers  control  to  an  ERROR  routine  if  control  falls  out  the 
bottom  of  the  section. 

The  algorithm  is  based  on  Earley's  intuitive  notion  that  the 
top  symbols  on  the  stack  matched  against  the  right  parts  of  certain 
productions  should  determine  parsing  decisions.   It  is  an  extension  of 
his  algorithm  in  that  it  provides  for  both  finite  look-ahead  and  finite 
look-back  and  in  that  it  covers  productions  with  empty  right  parts. 
It  is  a  simplification  of  his  alogrithm  in  that  it  allows  reductions 
only  at  the  top  of  the  stack,  therefore  reducing  the  number  of  mapping 
rules. 


*It  is  assumed  that  the  reader  is  familiar  with  reductions 
analysis  programs  and  the  associated  stack,  input  string,  and  manipula- 
tions thereupon. 

**This  nomenclature  is  adapted  to  clarify  the  distinction  between 
the  BNF  productions,  which  together  define  the  grammar  of  a  language,  and 
the  FPL  statements  which,  when  combined  to  form  a  program,  describe  a  parser, 
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A  word  notation  is  in  order  before  proceeding.   In  this  paper, 
non- terminal  symbols  are  represented  by  Latin  capitals,  terminals  by 
lower  case  Latin  letters,  arbitrary  strings  by  the  Greek  letter  <x,  and 
the  empty  string  by  the  Greek  letter  e.   The  Greek  letter  a  designates 
a  symbol  which  matches  any  other  symbol. 

The  Algorithm 

The  algorithm  simply  consists  of  :   (a)  three  rules  to 
determine  what  sections  are  necessary  for  the  program,  (b)  three 
corresponding  rules  to  determine  which  productions  should  be  mapped  into 
statements  for  each  section,  (c)  four  rules  to  map  the  productions  into 
statements,  (d)  a  rule  which  prescibes  the  combination  of  some  state- 
ments in  a  given  section  and  a  corresponding  combination  of  certain 
sections,  and  (e)  a  contextual  analysis  rule  for  expanding  statements 
so  no  two  statements  in  a  given  section  are  both  applicable  to  a  given 
stack  and  input  string  configuration.   (The  latter  operation  is  referred 
to  below  as  making  the  statements  disjoint. )  Of  course,  there  are  also 
several  rules  for  optimizations,  some  of  which  are  given  toward  the 
end  of  the  paper. 

(a)  Necessary  Sections.  A  special  STAET  section  and  a  special  section 
for  SUCCESS  EXIT  are  required  together  with  the  following: 

(1)  A  section  labeled  Nh  is  required  for  each  non-terminal  N 
which  appears  in  the  right  part  of  some  production  as  other  than 
the  first  symbol.   This  section  is  activated  whenever  one  of  the 
terminals,  which  may  begin  N,  is  supposed  to  be  at  the  top  of  the 
stack.   It  is  the  purpose  of  section  Nh  to  verify  that  one  of 
these  terminal  "head  symbols"  is  indeed  at  the  top  and,  depending 
upon  which  terminal  is  there,  to  take  appropriate  action  to 
commence,  and  perhaps  conclude,  a  reduction  to  N. 

(2)  A  section  labeled  t(it,p)  is  required  for  each  occurrence  of 
a  terminal  as  the  p-th  symbol  in  the  right  part  of  each  pro- 
duction Tr,  where  p  >  2.   This  section  consists  of  exactly  one 
statement  which  compares  the  first  p  symbols  of  production  « 
with  the  top  p  symbols  of  the  stack.   It  is  activated  only  when 
the  match  must  occur  for  a  well-formed  string.   Its  purpose  is 
to  verify  the  top  symbol  and  to  take  appropriate  action  to 
continue  the  parse. 
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(3)  A  section  labeled  Nt  is  required  for  each  non- terminal 
N  which  appears  in  the  right  part  of  some  production.   The 
section  is  activated  immediately  after  a  reduction  to  N  occurs 
at  the  top  of  the  stack.   The  statements  in  this  section 
indicate  comparisons  to  the  stack  to  determine  which  of  the 
production(s)  in  whose  right  part  N  appears  is  applicable  to 
the  case  at  hand.   A  match  determines  the  appropriate  sub- 
sequent action. 

(b)  Descripton  Sets.   In  order  to  generate  the  appropriate  set  of  state- 
ments for  a  given  section,  a  descriptor  set  of  pairs  D  =  (k,  p  ),...) 
is  assocated  with  each  section  label.   This  descriptor  set  is  deter- 
mined by  investigating  the  productions  and  serves  to  indicate  to  which 
part  of  which  production(s)  the  mapping  rules  described  below  are  to 

be  applied.   The  pair  (rt,p)  points  to  the  first  p  symbols  of  production 
n  as  the  stack  comparison  symbols  of  the  corresponding  statement.   The 
descriptor  sets  are  determined  as  follows: 

(1)  D„  :   Initially  D   is  empty  and  the  following  recursive 
procedure  is  applied.   The  right  part  of  each  production  rt  that 
defines  N  is  examined.   If  it  is  empty,  then  (rt,0)  is  added  to 

D„  :  if  it  begins  with  a  terminal,  then  (jt.l)  is  added  to  D,-  : 

Nh  Nh 

otherwise  it  begins  with  a  non-terminal  and  the  procedure  is 
applied  to  that  non- terminal. 

(2)  D.  /    \  contains  exactly  one  pair  (it,p). 

t(,jt,pj 

(3)  D,,,  :   The  right  part  of  each  production  n  is  examined.   If 

J.M  \j 

the  non- terminal  N  appears  as  the  p-th  symbol,  then  (n,p)  is 

in  DNf 

(c)  The  BNF  to  FPL  Mapping  Rules.   Presented  in  Table  I  are  four  rules 
for  mapping  BNF  productions  into  FPL  statements.   Together  with  the 
descriptor  sets  they  represent  a  naive  first  try  at  generating  a  parser 
for  the  grammar.   Implicity,  the  rules  assume  there  is  no  question  about 
which  production  applies  to  the  case  at  hand  but  only  what  action  is  to 
be  taken  by  the  parser  next,  given  that  a  certain  production  is  applicable. 
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Table  I 

The  ENF  to  FPL  mapping  rules.   (cc  represents  the  first  p 
symbols  of  production  n ,    a  is  a  symbol  which  matches  any  other  symbol 
and  q  =  p  +  1 . ) 


ENF  production  (ft,p) 

(1)  M  ::=  ON   ... 
:=  Ob  ... 


(2)  M 

(3)  M 
(h)   M 


=  a 


=   e 


maps  into 


FPL  statement 
a|   *      Nh 
a|   *   t(*,q) 
a|  ->  m|  Mt 

a\      -►  Ml  er  Mt 
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It  is  the  purpose  of  the  last  tvo  rules  of  the  algorithm  to  extend  it  to 
cover  a  reasonable  set  of  grammars  by  resolving  confusion  about  which 
production(s)  may  apply  to  different  cases  within  a  given  section. 

The  rules  of  Table  I  are  explained  intuitively  as  follows.   If 
the  first  p  symbols  of  the  right  part  of  production  n  are  at  the  top  of 
the  stack  and 

(1)  if  the  (p+l)st  symbol  is  a  non-terminal  N,  then  the  parser 
should  scan(*)  the  next  terminal  and  activate  section  Nh  to 
begin  to  reduce  a  substring  to  N. 

(2)  if  the  (pfl)st  symbol  is  a  terminal  b,  then  the  parser  should 
scan  the  next  terminal  and  activate  section  t(jt,q),  where 

q  =  p  +1,  to  verify  that  that  terminal  is  indeed  b  and  to 
decide  how  to  continue  the  parse. 

(3)  if  the  p-th  symbol  is  last  in  the  right  part  of  the  produc- 
tion, then  the  parser  should  make  a  reduction  (->)  to  the  symbol 
M  defined  by  the  production  and  activate  section  Mt  to  decide 
how  to  continue  the  parse. 

(k)    if  p  =  0  (and,  therefore,  the  right  part  of  the  production 
is  empty),  then  the  parser  should  "unscan"  the  top  symbol,  push 
an  M  onto  the  top  of  the  stack,  and  activate  section  Mt  to 
decide  how  to  continue  the  parse.   (The  symbol  unscanned  will 
always  be  a  terminal  since  this  statement  will  appear  only  in 
an  Nh-type  section,  the  activation  of  which  is  always  immediately 
preceded  by  a  scan  (see  rule  (l)).) 

(d)  Combinations.   In  general,  a  reductions  analysis  program  generated 
according  to  the  above  rules  will  contain  sections  in  which  some  of  the 
statements  are  not  disjoint.   That  is,  the  conflicting  statements  will 
indicate  stack  comparisons  (l)  which  are  identical,  or  (2)  the  shorter 
of  which  are  identical  to  the  top  few  symbols  of  the  longer  ones.   Thus, 
several  statements  may  be  applicable  to  a  single  stack  and  input  string 
configuration,  and  the  parser  is  in  some  sense  non-deterministic.   To 
render  the  parser  deterministic  it  must  be  modified  so  it  can  either  delay 


-  5  - 


or  determine  the  decisions  concerning  which  of  the  several  similar 
productions  associated  with  the  conflicting  statements  is  applicable 
in  various  cases.   Decision  delays  are  effective  "by  pairwise  statement 
combinations  as  follows. 

If  a  pair  of  statements  in  a  given  section  are  not  disjoint 
and  if  each  was  generated  according  to  either  mapping  rule  (l) 
or  (2),  then  replace  them  with  a  single  statement:  one  whose 
stack  comparison  is  the  shorter  of  the  two  and  which,  upon  a 
successful  stack  match,  scans  a  new  terminal  and  activates  a 
new  combination  section  which  must  be  added  to  the  program. 
The  new  section  is  that  section  whose  description  set  is  the 
union  of  the  two  descriptor  sets  of  the  sections  which  the 
original  statements  would  have  activated.  Of  course,  the  new 
section  must  be  checked  for  disjointness,  and  the  old  sections, 
of  which  the  new  one  is  a  combination,  should  be  checked  for 
usefulness,  since  the  only  reference  in  the  entire  program  to  one 
or  both  might  have  been  deleted  by  removal  of  the  two  statements. 

(e)  Expansion  by  Contextual  Analysis.   The  only  decisions  which  cannot  be 
delayed  are  those  concerning  reductions.   This  limitation  is  due  to  the 
requirement  that  reductions  be  made  only  at  the  top  of  the  stack.   Thus, 
conflicts  with  statements  generated  according  to  mapping  rules  (3)  and 
(k)   cannot  be  cured  by  combination.   In  this  case  the  statements'  com- 
parison fields  are  expanded  by  contextual  analysis  to  provide  the  parser 
with  whatever  finite  look-ahead  and  look-back  are  necessary  to  make 
the  decision  at  hand,*  i.e., 

for  each  of  the  conflicting  statements  the  grammar  is  inves- 
tigated and  generation  begun  of  the  strings  of  symbols  which, 
in  the  context  of  the  production  associated  with  the  statement, 
may  surround  the  original  stack  comparison  substring  a  of  the 
statement.   Appropriate  comparison  of  the  composite  strings 
associated  with  each  of  the  original  statements,  indicates  the 
minimum  context  which  must  be  checked  to  make  the  statements 
disjoint.   In  the  worst  case  each  statement  must  be  replaced 
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with  several  statements  which  differ  from  the  original  in 
that  they  indicate  more  symbols  which  must  be  matched  in  the 
stack  and/ or  the  input  string. 


Examples 


Since  the  parser  proceeds  from  left  to  right,  always  making 
reductions  at  the  top  of  the  stack  on  the  "basis  of  whatever  finite  look- 
ahead  and  look-back  are  necessary,  the  algorithm  by  definition  covers  all 
bounded  right  context  grammars.   Further,  due  to  the  fact  the  sections  of 
the  program  themselves  imply  certain  extra  information  about  the  stack 
configuration,  in  the  same  sense  that  a  state  of  a  finite  state  acceptor 
implies  information  about  the  string  read,  the  algorithm  also  covers 
some  LR(k)  grammars  which  are  not  bounded  right  context.  An  example 
grammar  in  this  class  is  S  : :=  aA|bB,  A  : :=  cA|d,  B  ::=  cB|d,  the 
sentences  of  which  are  a  c  d  and  be  d.   It  is  not  bounded  right 
context  since  the  clue  as  to  whether  to  reduce  d  to  A  or  B  is  an  a  or  b 
arbitrarily  far  down  the  stack.   The  grammar  is  however,  LR(O)  and  can 
be  parsed  by  the  algorithmically  generated  parser  of  Figure  1.   Note 
that  a  transfer  of  control  to  an  ERROR  routine  is  implicit  at  the  bottom 
of  each  section  in  case  no  match  occurs. 
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START (Sh) 


Ah 


Ah 


Bh 


At 


Bt 


St 


b 

-K- 

Bh 

c 

* 

Ah 

a 

— » 

A 

At 

c 

-* 

Bh 

a 

— » 

B 

Bt 

cA 

— > 

A 

At 

aA 

— » 

S 

St 

cB 

— > 

B 

Bt 

bB 

— » 

S 

St 

SUCCESS  EXIT 

Figure  1.   Algorithmically  generated  parser  for  a  grammar  which  is  LR(O) 
but  not  bounded  right  context. 


As  an  example  of  a  grammar  requiring  "both  look-ahead  and  look- 
back consider  the  following. 


IT 

P 

123 

1 

S   : 

:=  cAB 

2 

S   : 

:=  dAe 

3 

A    : 

:=  aG 

k 

B   : 

:=  xe 

5 

G    : 

:=  Gx 

6 

G    : 

:=  x 

Confusion  arises  in  the  Gt  section  about  when  to  terminate  the  gathering 
of  x's  into  the  non-terminal  G.   Generation  of  the  context  related  to 
production  five  produces  three  possible  strings: 

(1)  G|  -*  G|x  -»   G|xx 

(2)  G|  -»  G|x  -*  aG|x  -*  daG|xe 

(3)  G|-*G|x->aG|x^  caG|xB  -*  caG|xxe 

There  are  two  possible  strings  for  production  three: 


(1)  aG 

(2)  aG 


daG|e 
caGlB 


caG  xe 


Most  of  the  confusion  is  between  case  (2)  of  production  five  and  case  (2) 
of  production  three.   One  possible  solution  is  to  construct  the  following 


Gt  section 


Gt 


G|xx 
daG|xe 
aGl 


A 


t(5,2) 

t(5,2) 
At 


Note  that  advantage  has  been  taken  of  the  sequential  nature  of  the 
program  here.   Since  the  first  two  statements  will  catch  all  config- 
urations to  which  production  five  is  applicable,  the  statement  associated 
with  production  three  checks  no  extra  context.   That  is,  the  restriction 
that  the  statements  in  a  given  section  must  be  disjoint  may  be  relaxed 
in  special  cases  where  advantage  is  taken  of  the  order  in  which  statements 
are  executed,  however  the  contextual  analysis  must  still  be  performed  to 
ascertain  the  validity  of  such  an  optimization.   Finally,  note  that  had 
production  five  been  G  : :=  xG  the  grammar  would  not  have  been  bounded  right 
context  nor  covered  by  this  algorithm,  although  it  would  still  be  LR(2). 
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As  a  final,  larger,  and  more  practical  example  consider  the 
grammar  of  Table  II,  which  is  Earley's  example  of  a  'simple  algebraic 
language.   The  corresponding  list  of  necessary  sections  and  their 
descriptor  sets  are  presented  in  Table  III,  and  the  parser  is  given 
in  Figure  2.   This  grammar  requires  no  special  look-back  end  look- 
ahead  of  more  than  one  symbol  in  only  one  case,  section  Dt.   A  single 
pair  of  statements  were  combined  in  section  Ht  causing  the  combination 
of  sections  t(l2,2)  and  t(4,2)  to  form  a  section  labeled  t(l,2;  4,2). 
Note  that  such  combinations  are  probably  most  efficiently  effected  by 
operations  on  the  descriptor  sets  before  the  sections  are  generated. 
Also  note  that  maximum  advantage  was  taken  of  ordering  the  statements. 
However,  for  expositional  purposes  several  optimizations  were  not  made: 
(l)  since  the  first  p-1  symbols  are  matched  immediately  prior  to  its 
activation,  a  t(n,p)  section  need  match  only  the  p-th  symbol  with  the 
top  symbol  of  the  stack,  (2)  since  a  reduction  to  N  occurs  immediately 
prior  to  the  activation  of  an  Nt  section,  it  need  not  match  the  top 
symbol,  and  (3)  several  sections  could  have  been  "concatenated",  as  for 
example  sections  Dt,  t(6,2),  and  t(6,3)  which  would  form 

Dt     D|;r    ***         Th 
bD|  H|     Ht 

Finally,  since  sections  Ph,  Fh,  and  Th  are  identical,  and  are  a  subset 
of  section  Eh,  all  these  could  have  been  combined  to  save  space;  however 
this  is  probably  undesireable  as  it  implies  a  loss  of  information 
useful  for  error  recovery. 
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PRODUClTON  table 


p 
1 


<AXIOM> 

0 

<BLOCK> 

1 

<HEAD> 

2 

3 

h 

<DECLARATION> 

5 

6 

<TYPE  LIST> 

7 

8 

<STATEMENT> 

9 

^EXPRESSION> 

10 

11 

12 

<TERM> 

13 

lk 

<FACTOR> 

15 

16 

<PRIMARY> 

17 

18 

z 

B 

H 
H 
H 
D 
D 

Ti 

Ti 

S 
E 
E 
E 
T 
T 
F 
F 
P 
P 


H 
b 

b 
H 
r 
D 
i 

Ti 

i 

T 
_+ 
E 
F 
T 
P 
F 
i 

( 


B 
e 

D 

y 

T, 


T 
f 


l 
E 


E 


NOTE:      i   is   identifier 
r  is   real 
"b  is  "begin 


e   is  end 


Table  II.   Production  table  for  a  simple  algebraic  language, 
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NECESSARY 
SECTIONS 


DESCRIPTOR 
SETS 


START   (Eh) 
B  h 
D  h 

V 

T  h 

E  h 

S  h 

F  h 

P  h 


t 

t 

t 

t 

t 

t 

t 

t 

t 

t 

t 

t 

H  t 

D   t 
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Table  III.   List  of  necessary  sections  and  the  corresponding  descriptor 
sets  for  the  grammar  of  Table  II. 
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Figure  2.   Algorithmically  generated  parser  for  a  single  algebraic 
language. 
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Figure   2.    (continued) 
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